The following command prints, “Hello, world!”
print("Hello, world!")
## [1] "Hello, world!"
Installing the necessary libraries - tidyverse
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
##Load Data
load("college.Rdata")
college.Rdata dataset:Variable Name :Definition
instnm: Institution Name
stabbr: State Abbreviation
year: Year
control: control of institution, 1=public, 2= private non-profit, 3=private for-profit
preddeg: predominant degree, 1= certificate, 2= associates, 3= bachelor’s, 4=graduate
adm_rate: Proportion of Applicants Admitted
sat_avg: Midpoint of entrance exam scores, on SAT scale, math and verbal only
costt_4a: Average cost of attendance (tuition and room and board less all grant aid)
debt_mdn: Median debt of graduates
md_earn_wne_p6: Earnings of graduates who are not enrolled in higher education, six years after graduation
ugds: number of undergraduates
The following code filters for schools whose admission rate is greater than 30%, a condition we deem appropriate for schools being the least selective. Within those schools, I tasked Rstudio to create a variable that calculated the average earnings of graduates, six years after graduation.
Then, I selected the schools whose admission rates is less than 10%, signifying that these schools are the most selective. Within this group of schools, Rstudio created a variable that calculated the average earnings of graduates, six years after graduation.
## What's the average earnings for individuals at the least selective schools?
sc%>%filter(adm_rate>.3)%>%summarize(md_earn_wne_p6=mean(md_earn_wne_p6,na.rm=TRUE))
## # A tibble: 1 x 1
## md_earn_wne_p6
## <dbl>
## 1 34747.
## What's the average earnings for individuals at the most selective schools?
sc%>%filter(adm_rate<.1)%>%summarize(md_earn_wne_p6=mean(md_earn_wne_p6,na.rm=TRUE))
## # A tibble: 1 x 1
## md_earn_wne_p6
## <dbl>
## 1 53500
Answer: The average earnings for individuals at the most and least selective colleges is 53500 and 34747, respectively.
College size can be interpreted using the variable, ugds, the number of undergraduates. Dell’Arte International School of Physical Theatre has the smallest undergraduate program of only 27 students. Florida International University has the largest number of undergraduates, 30,920 students. To determine if colleges with high SAT scores tend to be larger or smaller than colleges with low SAT scores, I decided to create a scatterplot graphic to examine the relationship between school size and average SAT score.
The following code, ggplot, will create a graphics object. I will name it gg. Within this, I declare the dataset (‘sc’) the x variable (‘sat_avg’) and the y variable (‘ugds’). I will also note that I want to use the institutional name (‘instnm’) as text. The last couple lines of code makes this scatterplot interactive, which makes it possible to put a mouse cursor over a particular point and see what university it corresponds to.
gg<-ggplot(data=sc, aes(x=sat_avg,y=ugds,text=instnm))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+xlab("Average SAT")+ylab("Number of Undergraduates")
gg<-gg+ggtitle("Median Student Debt and Average Cost of Tuition")
gg
## Warning: Removed 27 rows containing missing values (geom_point).
gg_p<-ggplotly(gg)
gg_p
The scatterplot allows me to holistically see, in our dataset, to what extent average SAT score influenced the size of college. From this, I decided to filter the dataset, selecting for those universities with the highest SAT scores and the lowest SAT scores. Seeing the scatterplot allowed me to confidently assign that colleges with the highest SAT scores are those whose SAT scores are greater than 1400. Colleges with the lowest SAT scores are those whose SAT scores is less than 900.
sc%>%filter(sat_avg>1400)%>%select(instnm,sat_avg,ugds)%>%arrange(-ugds)
## # A tibble: 20 x 3
## instnm sat_avg ugds
## <chr> <dbl> <int>
## 1 University of Pennsylvania 1436 10842
## 2 Northwestern University 1427 8905
## 3 University of Notre Dame 1433 8367
## 4 Columbia University in the City of New York 1445 7743
## 5 Harvard University 1468 7181
## 6 Emory University 1403 6868
## 7 Vanderbilt University 1430 6764
## 8 Stanford University 1436 6564
## 9 Washington University in St Louis 1462 6436
## 10 Duke University 1440 6416
## 11 Brown University 1420 6013
## 12 Yale University 1475 5258
## 13 Tufts University 1450 5146
## 14 University of Chicago 1425 5101
## 15 Princeton University 1482 5029
## 16 Massachusetts Institute of Technology 1472 4218
## 17 Dartmouth College 1432 4090
## 18 Rice University 1425 3279
## 19 Williams College 1424 2033
## 20 California Institute of Technology 1514 951
Of the 20 colleges whose average SAT scores is higher than 1400, most have a fairly high number of undergraduate students - 11 of them have more than 6000 undergraduate students. California Institute of Technology is unlike the rest, in that it only enrolls 951 undergraduate students.
sc%>%filter(sat_avg<900)%>%select(instnm,sat_avg,ugds)%>%arrange(-ugds)
## # A tibble: 11 x 3
## instnm sat_avg ugds
## <chr> <dbl> <int>
## 1 California State University-San Bernardino 894 14373
## 2 New Jersey City University 835 6328
## 3 Grambling State University 851 4534
## 4 Albany State University 876 3988
## 5 University of Arkansas at Pine Bluff 784 3624
## 6 Delaware State University 868 3222
## 7 Central State University 759 2398
## 8 Mississippi Valley State University 825 2350
## 9 Kentucky State University 823 2326
## 10 Lincoln University 812 2020
## 11 Claflin University 895 1735
Of the 11 colleges whose average SAT scores is lower than 900, ten of them have less than 5000 undergraduate students. Overall, when comparing schools with very high and very low SAT scores, it appears that colleges with very high SAT scores (greater than 1400) tend to be larger in school size.
To examine the relationship between the cost of attendance and median student debt, I will use the following code, ggplot, to create a graphics object. I will name it as gg_costdebt. Using the dataset (‘sc’) the x variable (‘debt_mdn’) and the y variable (‘costt4_a’). I will also note that I want to use the institutional name (‘instnm’) as text. The last couple lines of code makes this scatterplot interactive, which makes it possible to put a mouse cursor over a particular point and see what university it corresponds to. This graphic is stored as “gg1.”
gg<-ggplot(data=sc, aes(x=costt4_a,y=debt_mdn,text=instnm))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+xlab("Average Cost of Attendance")+ylab("Median Debt of Graduates")
gg
## Warning: Removed 1 rows containing missing values (geom_point).
gg_p<-ggplotly(gg)
gg_p
gg1<-gg_p
On first impression, one might presume that if the average cost of attendance is higher, then the median debt of graduates will also be higher. The scatterplot demonstrates that to some extent. Several universities whose average cost of attendance is in the range of 50k has the highest median debt of graduates, around 16k to 17k. In fact, the majority of graduates with median debt of less than 15,000 dollars receive a diploma from schools whose average cost of attendance is 40,000 dollars or less. Furthermore, schools with the least average cost of attendance, around 10,000 dollars have the smallest median of debt. In examining the extremes, schools with the highest and lowest average cost of attendnce, their graduates have the highest and lowest median debt, respectively. This aligns with my initial assumptions.
There are a couple elements in the scatterplot that surprise me. First, I did not expect there to be such a wide range of median debt among the universities whose average cost of attendance was around 50,000. For example, the average cost of attending Harvard University is 50,250 but the median debt of its graduates is only 6000. Yet, students who graduate from the California Institute of the Arts, whose average cost of attendance is 48,784, finish with a median debt of 18187.5. It would be interesting to explore to what extent their graduates were given grants and scholarships to support their educational expenses, but this information was not provided in the current dataset.
Second, I did not anticipate schools whose average cost of attendance is between 13,000 - 35,000 dollars to have graduates with such a similar median debt amount. It baffles me to think that graduates from Albany State University to have a median debt similar to graduates from Rocky Mountain of Art and Design, even though the average cost of attending Rocky Mountain of Art and Design is almost 21,000 dollars more expensive than Albany State University.
The dataset includes three types of institutions, public, private non-profit and private for-profit. This characteristic is classified under the variable (‘control’).
control: control of institution, 1=public, 2= private non-profit, 3=private for-profit
The following code will tell me how many public, private non-profit, and private for-profit schools there are in this dataset.
table(sc$control)
##
## 1 2 3
## 35 83 7
There are 35 public, 83 private non-profit, and 7 private for-profit schools in this dataset.
I am interested in examining the relationship between average cost of attendance and median debt of graduates for the public institutions, only. To do this, I have to use both the ‘filter’ and ‘select’ commands. I first select only public institutions. Within that, I select the average cost of attendance and median debt of graduates. The data is presented in desending order of cost. This subset of data is stored as ‘public.’
sc%>%filter(control=="1")%>%
select(instnm,costt4_a,debt_mdn)%>%arrange(-costt4_a)
## # A tibble: 35 x 3
## instnm costt4_a debt_mdn
## <chr> <int> <dbl>
## 1 University of California-Berkeley 26275 12312.
## 2 University of California-Los Angeles 24725 11523
## 3 University of California-San Diego 23433 13394.
## 4 College of William and Mary 20806 13335
## 5 University of Virginia-Main Campus 20488 12000
## 6 Lincoln University 19126 15687
## 7 California Polytechnic State University-San Luis Obis… 18978 11958
## 8 SUNY at Purchase College 18116 14000
## 9 SUNY at Binghamton 17956 12625
## 10 State University of New York at New Paltz 17779 12500
## # … with 25 more rows
Public <- sc%>%filter(control=="1")%>%select(instnm,debt_mdn,costt4_a)
There are 35 public institutions in the dataset. This information can be visually shown in a scatterplot. The following code uses the command ggplot to create a graphics object. Layering this with “geom_point” tells the program to present a scatterplot. The image is stored as “gg_pub” for later use.
gg<-ggplot(data=Public,aes(x=costt4_a,y=debt_mdn,text=instnm))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+xlab("Average Cost of Attendance")+ylab("Median Debt")
gg<-gg+ggtitle("Debt and Cost in Public Universities")
gg
gg_pub <-gg
The scatterplot above demonstrates a positive relationship between average cost of attendance and median debt in the 35 universities.
I am now interested in examining the relationship cost and debt of graduates for the private non-profit institutions, only. The following code filters for private non-profit institutions, saves it as a separate dataset named ‘privatenp.’ The ggplot command will use only the data from this group of 83 schools to create a scatterplot graphics object.
sc%>%filter(control=="2")%>%
select(instnm,costt4_a,debt_mdn)%>%arrange(-costt4_a)
## # A tibble: 83 x 3
## instnm costt4_a debt_mdn
## <chr> <int> <dbl>
## 1 Georgetown University 53425 14500
## 2 George Washington University 52707 15021
## 3 Washington University in St Louis 52464 14499
## 4 Middlebury College 52460 11250
## 5 University of Chicago 52450 13126
## 6 Vanderbilt University 52303 12625
## 7 Carnegie Mellon University 52150 17125
## 8 Northwestern University 52080 12500
## 9 Boston College 52007 17125
## 10 Wesleyan University 51935 16690
## # … with 73 more rows
privatenp <- sc%>%filter(control=="2")%>%select(instnm,debt_mdn,costt4_a)
gg<-ggplot(data=privatenp,aes(x=costt4_a,y=debt_mdn,text=instnm))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+xlab("Average Cost of Attendance")+ylab("Median Debt")
gg<-gg+ggtitle("Debt and Cost in Private Non-Profit Universities")
gg
## Warning: Removed 1 rows containing missing values (geom_point).
gg_privatenp <-gg
The scatterplot also shows a positive relationship between average cost of attendance and median debt. It’s important to note that many of the private non-profit schools have an average cost of attendance of around 50k. And, among this tightly grouped schools, there is a wide range of median debt.
The following code is used the generate the last plot, showing the relationship between median debt and average cost in eight private for-profit schools. This visual is saved as gg_privateprof for later use.
sc%>%filter(control=="3")%>%
select(instnm,costt4_a,debt_mdn)%>%arrange(-costt4_a)
## # A tibble: 7 x 3
## instnm costt4_a debt_mdn
## <chr> <int> <dbl>
## 1 South University-The Art Institute of Dallas 40851 9500
## 2 Argosy University-The Art Institute of California-San … 35858 9616.
## 3 Schiller International University 35408 6500
## 4 Rocky Mountain College of Art and Design 34589 11562.
## 5 University of Advancing Technology 32054 11625
## 6 DigiPen Institute of Technology 23969 16125
## 7 The National Hispanic University 19135 5500
privateprof <- sc%>%filter(control=="3")%>%select(instnm,debt_mdn,costt4_a)
gg<-ggplot(data=privateprof,aes(x=costt4_a,y=debt_mdn,text=instnm))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+xlab("Average Cost of Attendance")+ylab("Median Debt")
gg<-gg+ggtitle("Debt and Cost in Private For-Profit Universities")
gg
gg_privateprof <-gg